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Abstract 

In this short paper, the Electre Tri-Machine Learning Method, gener¬ 
ally used to solve ordinal classification problems, is proposed for solving 
the Record Linkage problem. Preliminary experimental results show that, 
using the Electre Tri method, high accuracy can be achieved and more 
than 99% of the matches and nonmatches were correctly identified by the 
procedure. 


1 Introduction 

Machine Learning is a scientific discipline that is concerned with the design 
and development of algorithms that allow computers to “learn data”. More 
precisely, “learn” is here intended as the possibility to automatically recognize 
complex patterns and make “intelligent” decisions, based on information data. 
Hence, machine learning is closely related to fields such as statistics, probability 
theory, data mining, pattern recognition, artificial intelligence, adaptive control 
and theoretical computer science. 

Machine learning algorithms can be classified in the following types: 

• supervised learning algorithms: a function/classifier is generated, that 
maps outputs on the training inputs, based on labeled examples input- 
output; 

• unsupervised learning algorithms: patterns in the input are recognized, 
the examples have no labels; 

• semi-supervised learning algorithms: supervised and unsupervised learn¬ 
ing information is combined; 
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• reinforcement learning: actions from observation of the world are gener¬ 
ated. Every action has some impact in the environment and the environ¬ 
ment provides feedbacks that are translated into a score that guide the 
learning process. 

The principal supervised learning techniques currently applied or under consid¬ 
eration at statistical agencies worldwide to solve the record linkage matching 
problem are: classification tree SHU, support vector machine mm® and neu¬ 
ral network m- In this short paper, another machine learning technique is 
proposed to solve the record linkage problem: the multi-criteria classification 
method Electre Tri. It is the first time that multi-criteria machine learning 
technique is used to solve the record linkage problem. 

This application answers to one of “many challenges in applying supervised 
machine learning to record linkage matching” [TO] . showing that the use of 
multi-criteria classification method Electre Tri to solve the record linkage prob¬ 
lem provides good results in term of classification model performances. The 
importance of this application is in light of the increasing development of the 
use of administrative sources data. In this context, an important problem is that 
of finding matching pairs of records from heterogeneous databases, while main¬ 
taining privacy of the databases parties. To this purpose secure computation of 
distance metrics is important for secure record linkage [5]. 

The paper is organized as follows. Section [2] describes an introduction to the 
Record Linkage problem; then the next Section [3] describes the method Electre 
Tri, used to solved the Record Linkage and in the last Section 0] a preliminary 
experiment is conducted on simulated data. The paper closes with some final 
remarks and conclusions. 


2 Linked Data: the Record Linkage 

Generally speaking, in integration of two data sets the objective is the detection 
of those records, in the different data sets, that belong to the same statistical 
unit. This action allows the reconstruction of a unique record of data that 
contains all the unit information collected from different data sources on that 
unit. 

Therefore, record linkage is the methodology of bringing together corre¬ 
sponding records from two or more files or finding duplicates within files |l6] . In 
the first situation, the definition of record linkage in [9] is more precise “Record 
linkage is a solution to the problem of recognizing those records in two files 
which represent identical persons, objects, or events (said to be matched)” 

The term record linkage originated in the public health area when files of 
individual patients were brought together using name, date-of-birth and other 
information [Hi- 

One of the main motivations for the utilize of the record linkage method is 
the construction of the big data bases for answer to the new informative needs 
0 - 
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In order to better understand the problem, small practical example is now pre¬ 
sented. Suppose the user wants to link two datasets of persons A and B, whose 
the variables Name, Address and Age are known. 

Suppose that Table m contains the following values: 


Table A: Data in the first dataset 


Unit 

Name 

Address 

Age 

al 

John A Smith 

16 Main Street 

16 

a 2 

Javier Martinez 

49 E Applecross Road 

33 

a3 

Gillian Jones 

645 Reading Aev 

22 


Furthermore, suppose that Table iBl contains the following values: 


Table B: Data in the second dataset 


Unit 

Name 

Address 

Age 

bl 

J H Smith 

16 Main St 

17 

b 2 

Haveir Marteenez 

49 Aplecross Raod 

36 

b3 

Jilliam Brown 

123 Norcross Blvd 

43 


The matching table A x B contains two units referring probably to the same 
persons, that the method should individuate as matches: ’John A Smith’ with 
’J H Smith’ and ’Javier Martinez’ with ’Haveir Marteenez’. 

Modern record linkage begins with the pioneering work of Newcombe et 
al. H], who introduced odds ratio of frequencies and the decision rules for 
delineating matches and nonmatches. In recent years, advances have yielded 
computer system that incorporate sophisticated ideas from computer sciences, 
statistics and operational research [16]. 

Then, Fellegi and Sunter [9] introduced a mathematical foundation for record 
linkage. Their theory demonstrated the optimality of the decision rules used by 
Newcombe and introduced a variety of ways of estimating crucial matching 
probabilities (parameters) directly from the files being matches. 

Formally, given two files A and B to be matched, each pair (a, b) G F = AxB 
has to be classified into true match or true nonmatch. 

The odds ratios of probabilities is: 

Pr (7 £ F | M) 

Pr{ 7 G T | U) 

where 7 is an arbitrary agreement pattern in the comparison space T, M is the 
set of of true matches and U is the set of true nonmatches. Between these two 
sets, the intermediate set of the possible matches exists. 

The decision rule reported below helps to classify the pairs: 

• if R > Upper, then the pair (a, b) is a designated match, 

• if Lower < R < Upper, then the pair (a, b) is a designated potential match, 
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• if R < Lower, then the pair (a, b) is a designated nonmatch. 

The estimation of the thresholds Upper and Lower is not easy in an objective 
way; the choice is competence of the analyst. In the decision rule, three different 
sets were created: the designated matches, designated potential matches, desig¬ 
nated nonmatches. They constitute the partition of the set of all the records 
in the space T in three subsets C 3 (matches), C 2 ( potential matches) and C± 
(nonmatches), whose intersections are empty sets. 

The idea is to solve the record linkage problem as a multi-criteria based clas¬ 
sification problem, whose a priori defined classes are the subsets of the partition. 

Without going into too much details, in the next section a brief introduction 
to the method Electre Tri is presented. 

3 The multi-criteria method Electre Tri 

In Multi Criteria Decision Aid, a finite set of objects (alternatives, actions, 
projects) is evaluated by a finite set of criteria, which measure their perfor¬ 
mances. A criterion is the real-valued function gj : A —> 5ft, such that gj(ak) 
indicates the performance of the alternative on the criterion gj. The compar¬ 
ison of any pair of alternatives Gq and Ofc may be grounded to the comparison 
of the two values gj(at) and gj(a.k) [T5j . 

In general, a criterion can be either of gain or cost type; gain means that the 
DM prefers the highest value, while cost means that the DM prefers the lowest 
value on the criterion. 

Many types of criterion were studied in literature, such as true-criterion, 
pseudo-criterion, pre-criterion, semi-criterion and other types jl3l . 

In the case of true-criterion, if the difference between two performances is pos¬ 
itive, then the true-criterion structure implies that the alternatives are in the 
strict preference relation; while if the difference is equal to 0 , then they are in 
indifference relation. 

The Electre Tri is a pseudo-criterion-based method. This type of criterion 
takes into account that data can be affected by errors from uncertainty, impreci¬ 
sion and small differences or big can not imply the same binary relations. Small 
and big differences of performances have to imply different binary relations. To 
define “small” and “big”, two values are considered, which are the preference 
and indifference thresholds. 

In literature, grouping problems can be divided in clustering, classification 
and sorting problems, depending on the a priori/posteriori knowledge of classes. 
The sorting problem is a classification problem, dealt with multi-criteria ap¬ 
proach, requiring to Decision Maker (DM) any preference information. So, the 
aim of an ordinal sorting problem consists in assigning each alternative in one 
of the ordered predefined categories. 

Formally, given p predefined ordered categories C\, C 2 , ■ ■ ■ , C p and a finite set 
of n alternatives A = (ai,a 2 ,... evaluated on a finite set of m criteria 
G = { <71 , < 72 , • ■. ,g m }, in the case all criteria are gain-type, the relations among 
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the categories are C\ -< C 2 -< ■ ■ ■ -< C p , such that bh is the profile, upper limit 
of category Ch and lower limit of category Ch+i- In this way, C\ and C p are 
the worst and the best categories respectively. 

The Electre Tri method is based on outranking relations, indicated with S, 
which characterize how the alternatives are compared with the profiles. Because 
the assignment of an alternative to a specific category follows from the compar¬ 
ison, on all criteria, of its performances with the profiles ones. 

The relation aSbh validates or invalidates the assertion “a outranks bh” whose 
meaning is “a is at least as good as bh ”, on the set G. 

In the context of the Electre Tri method, the validation of outranking relation 
is made by the computation of four indices mm- 

1 . the partial concordance indices on each criterion; 

2 . the global concordance index on all the criteria; 

3. the partial discordance indices on each criterion; 

4. the credibility index on all the criteria. 

For the computation of the partial concordance indices, it is necessary to know 
the profiles, preference and indifference thresholds values. In the case one of 
these parameters are not known, the index can not be computed. For the 
computation of the global concordance index is necessary to know the weights, 
representing the importance coefficients of the criteria. For the computation of 
the partial discordance indices, it is necessary to know the profiles, preference 
and veto thresholds values. And the credibility index corresponds to the global 
concordance index weakened by veto effects. If veto thresholds do not enter in 
the model, the credibility index is equal to the global concordance index. From 
the credibility index to the definition of an outranking relation, it is necessary 
to fix a cutting level lambda, which is the minimum credibility index value 
which permits to define the outranking relation. Finally, the assignment of 
an alternative to one category does not result from the outranking relation 
directly, but it is necessary to use one (or both) of the two proposed exploitation 
procedures. They are the pessimistic and the optimistic assignment procedures. 
These procedures analyze the way an alternative compares to the profiles so as 
to determine the category to which the alternative should be assigned. 

One of the main difficulties is the elicitation of various parameters that in the 
Electre Tri are profiles, weights, thresholds (preference, indifference and veto) 
and cutting level lambda. Even if these parameters can be interpreted, it can 
be difficult to fix directly their values ( direct elicitation) and to have a clear 
global understanding of the implications of these values in terms of the output 

[la¬ 
in order to estimate indirectly the value of the parameters, De Leone and Min- 
netti [6] proposed new estimation methodology whose procedure is composed 
of two phases: the first dedicated to the profiles and thresholds estimations, 
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the second to the weights and cutting level estimations. The core of the proce¬ 
dure is the profiles’ estimation, suggested with Linear Programming (LP) using 
training set. 

Let p be the number of categories, m the number of criteria, the LP problem 
is the following: 


min E E bj (®k) 

j— 1 0 ,k—yCh 

Oj(a k ) > gj{a k ) - gj{b h ) Vj = l,...,m,Va fc C h ,h^ P (l) 

, Oj(a k ) > gj(b h -i) — gj{a k ) Vj = 1,... ,m,\/a k -*■ C h ^i 
gj{bh)> 9 j{bh- i) + e Vj = 1 ,. ..,m,Wh = 2,... ,p- 1 

Oj{a k )> 0 Vj = l,...,m,Va fc C h 

where e is a small positive value. 

The problem (1) minimizes the sum of the classification errors Oj(a k ) on all 
criteria and on all the alternatives in the training set, when this alternative’s 
performance lies out the belonged category. The first two constraints define the 
error 9j(a k ). 


4 Application to Real Data: a first experiment 

As said in the previous section, the multi-criteria approach requires DM any 
preference information , including binary relations. Since it is possible to state 
binary relations between the subsets as C 3 >- C 2 >~ C\, the record linkage 
problem can be structured as ordinal sorting problem, that is, classification 
problem whose classes are ordered in the strict preference binary relations. 

Moreover, the importance of using multi-criteria decision methods, with re¬ 
spect to the other classification methods, is in the possibility to assign weights to 
each criterion, not possible in all the classification problems, and to use the pref¬ 
erence information , provided by DM, for estimating the classification model’s 
parameters. 

The proposed application wants to find a classification model (i.e. classifier 
or learner), assigning each record of the space T to one of the three categories 
Ci, C 2 and C 3 , following the two phases procedure formulated by De Leone and 
Minnetti |B]. 

The input data, used in the application, were taken from Winkler from 
American Census (in SecondString file for approximate string matching tech¬ 
niques). Two data sets A and B are considered, containing 449 and 392 records 
respectively, and the true links are 327. 

The variables (textual fields from synthetic census data) are the following: 

• DS ( labels of the data sets with A and B); 


• IDENTIFIER; 

• SURNAME; 
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• NAME; 


• LASTCODE (middle name initial)', 

• NUMCODE (address street number ); 

• STREET (address street name). 

In this short paper, results from preliminary experiments are reported, because 
the application is an ongoing research, due to its complexity. 

Some variables contain missing values that cause difficulties in the analysis, 
making it more complicated. So in order to facilitate the analysis, the records 
with missing values are deleted. 

There are a number of popular methods of estimating the learner’s ability 
to generalize; the test set method was used here. In this experiment, the use of 
distance measure and the search of training set had played the most important 
roles; they had contributed to obtain good results of the classification model, 
found by Electre Tri [lTj. 

The performance of the classification model, applied to the test set (83868 
alternatives) was 99.09% when all the criteria have the same importance and the 
lambda parameter is A = 0.50. If lambda increases, the performance increases, 
up to 99.81% when A = 0.70 and 99.89% when A = 0.85. 

In the case the importance coefficients of criteria were considered different, the 
performances of the models were substantially the same, varying the lambda 
parameter. 

In the case of performance 99.09%, the classification errors were committed by 
the model on the false links; namely, the model saw almost all the true links. 
The opposite situation occurred in the case of performance 99.89%, when the 
model saw almost all the false links and misclassified the true links. To the DM 
the choice of the most interesting model, depending his preferences. 

5 Final Conclusion and Remarks 

In this short paper, the Electre Tri machine learning technique was proposed for 
solving Record Linkage matching. It is the first time that multi-criteria decision 
technique is used to solve the record linkage problem. 

The proposed application started with an initial experiment demonstrating 
that the application of the Electre Tri to record linkage shall provide good 
results in terms of classifier performances. This paper shows only the results of 
a preliminary experiment, which provided good results in terms of performances 
of the classification model. Also this experiment confirmed that record linkage 
is more sensitive to the quality of preprocessing and standardization that of 
matching, as said in US¬ 
As consequence, other measures of distance in the construction of the input 
data matrix, as well as, different schemes in the search of training set, will be 
used. 
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